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Abstract 



Feature selection is a technique to screen out less important features. Many existing su- 
O I pervised feature selection algorithms use redundancy and relevancy as the main criteria 

■ to select features. However, feature interaction, potentially a key characteristic in real- 

world problems, has not received much attention. As an attempt to take feature interaction 
into account, we propose ^i-LSMI, an fi-regularization based algorithm that maximizes a 
squared-loss variant of mutual information between selected features and outputs. Numer- 
ical results show that ^'i-LSMI performs well in handling redundancy, detecting non-linear 



5^ , dependency, and considering feature interaction. 
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1 Introduction 

Recently, solving real-world complex problems with supervised learning techniques has be- 
come more and more common. In supervised learning, using all variables as input to a learn- 
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ing algorithm works well when the number of variables is limited. However, when the num- 
ber of variables is large (e.g., gene expression-based patient classification), using all vari- 
ables in the learning process could lead to overfitting and a model interpretability problem 
[Zhao et al, 2010]. 

To overcome these problems, feature selection techniques are useful. Feature selection aims 
at removing unnecessary variables and retaining only relevant variables for the target supervised 
learning task. Many previous studies [Saeys et al., 2007, Suzuki et al., 2009] showed that fea- 
ture selection is useful in finding relevant variables to gain more insight of the data. Moreover, 
the generalization ability of the learned model can be improved through the removal of noisy 
variables [Peng et al, 2005, Langley, 1994]. 

Two conflicting criteria which are commonly used to select features are relevancy and re- 
dundancy. Features are relevant if they can explain outputs. Features are redundant if they are 
similar. It is trivial that more features are more likely to explain outputs well. However, more 
features are also more prone to be redundant [Peng et al., 2005, Zhao et al., 2010]. 

Feature interaction is also another important criterion to consider. Feature interaction is 
a situation in which two or more weak features can explain the output well in the context of 
each other, even though each of them alone may not be explanatory. It is one of the key char- 
acteristics in real-world problems. To detect a group of interacting features, it is necessary to 
simultaneously consider all features. This is because, by definition, considering features indi- 
vidually will not reveal any relevancy to the output. Due to this difficulty, feature interaction 
has not received much attention from the community. 

In this research, instead of focusing on only the relevancy and the redundancy as many 
previous studies did, we also take into consideration the interaction among features. We pro- 
pose ^i-LSMI, an ^i-regularization based algorithm that maximizes a squared-loss variant of 
mutual information between selected features and outputs. We also experimentally compare 
the proposed method with several state-of-the-art feature selection algorithms on both artificial 
and real data. Numerical results show that ^i-LSMI performs well in handling redundancy, 
detecting non-linear dependency, and considering feature interaction. 

The structure of this paper is as follows. We formulate our feature selection problem in 
Sect. 2. Then we describe optimization strategies commonly used in practice in Sect. 3, as 
well as several feature quality measures in Sect. 4. We argue that, among the listed strategies, 
^i-regularization based feature weighting is the best choice if we take into account the balance 
between computation and consideration of features. As a feature quality measure, we show 
that squared-loss mutual information (SMI) [Suzuki et al., 2009] possesses various desirable 
properties. Based on this argument, in Sect. 5, we propose to combine ^i-regularization and 
SMI, which we refer to as ^i-LSMI. Experiments on artificial and real data are described in 
Sect. 6. Finally, we conclude the paper in Sect. 7. 
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2 Problem Formulation 



A foraial description of a supervised feature selection problem is as follows. Assume we have 
an input data matrix X G R"'^" and output data vector Y e R", where m is the number of features 
and n is the sample size. X and Y are realizations of the random variable X = (Xi , X,„) and 
Y, respectively. Given the desired number of features k, supervised feature selection attempts 
to find a subset of features identified by the set of feature indices I c {1, . . .,m}, such that 
the underlying feature quality measure f is maximized. Formally, this can be formulated as an 
optimization problem as 

maximize /(Xj,Y) 

Jc|l,...,m| 

subject to I J| = k, 

where | • | denotes the set cardinality, and Xj denotes the data matrix X retaining only rows 
indexed by I. 

In general, / can be any function which can quantify the desired characteristics of the se- 
lected features. A popular choice for / is the classification accuracy of a chosen classifier 
[Kohavi and John, 1997]. While the selected features I obtained from this approach can yield 
a good classification accuracy, they are only specifically fit to the predictor in use. As a re- 
sult, an objective interpretation of I may be difficult [Guyon and Elisseeff, 2003]. In this work, 
we opt to focus on feature selection algorithms which are independent of a predictor for wide 
applicability. 

In practice, searching for a good feature subset to maximize / in a reasonable amount of 
time can be challenging. In fact, finding the global optimal feature subset is known to be NP- 
hard [Weston et al., 2003, Masaeli et al., 2010]. One way to guarantee that we can obtain the 
global optimal subset is to perform an exhaustive search over all possible subsets. However, 
since there are 2'" possible subsets in total, this approach is impractical for large m. Clearly, a 
good optimization strategy is needed to efficiently explore the subset space. 

As shown above, optimization strategies and feature quality measures are two important 
research issues in feature selection. We describe standard optimization strategies in Sect. 3, and 
popular feature quality measures in Sect. 4. 

3 Optimization Strategies 

The optimization strategy defines how to search for a good feature subset. The complexity 
of these optimization strategies range, with respect to the number of features m, from linear 
(feature ranking) to exponential (exhaustive search). Optimization strategies in general attempt 
to find features which have high relevancy to the output. Higher complexity in some strategies 
follows from the fact that feature redundancy is also taken into consideration. We start the 
discussion with fast feature ranking technique which does not consider feature redundancy. 
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3.1 Feature Ranking 



Feature ranking is one of the simplest feature optimization strategies. Given m features 
{Xi, . . ., X,n}, the feature ranking approach solves the optimization problem of the form 



To solve this problem, we calculate /(X,, Y) for / e {1, . . . , m}, rank Xj in the descending order, 
and then select the top k features. The notable feature selection algorithms based on this rank- 
ing scheme are Pearson correlation ranking, SPEC [Zhao and Liu, 2007], the Laplacian score 
[He et al., 2006], and the mutual information score [Suzuki et al., 2009]. 

Although simple and fast, feature ranking considers only the relevancy of features. Eval- 
uating each feature individually does not take into account the redundancy among features. 
Specifically, if there are many relevant features which are similar in nature, all of them will be 
ranked top. This is not desirable since having many similar features is usually as good as having 
just one. In other words, k best features are not the best k features [Peng et al., 2005]. 

3.2 Sequential Search 

To take feature redundancy into account, the popular sequential search [Kohavi and John, 1997, 
Song et al., 2007] can be used. It comes with two variants: forward and backward search. 
Forward search works iteratively by maintaining the currently selected features Xf At each 
step t, Xtis updated with 



where X* = argmax;^. fi^t-i U {X]) and Xq = 0. The backward search works similarly except 
that Xq contains the full feature set. At each step, a feature which reduces / the least is removed. 

A potential drawback of the sequential search is its greedy search nature which is inde- 
pendent of k. That is, the search paths are nested for different values of k. Specifically, it is 
decremental for the backward search, and incremental for the forward search. The result is that, 
for the backward search, once a feature is removed, it will never be considered again. Likewise, 
for the forward search, once a feature is added, it will never be removed even if it is found to be 
redundant at latter iterations. 

3.3 Feature Weighting 

Feature weighting [Tibshirani, 1996, Zhu et al., 2004, Li et al., 2006, Liu et al., 2009] is an ap- 
proach which can search for features with a continuous optimization. Formally, the feature 
weighting approach attempts to find a feature weight vector w e R.™ which is the solution of the 
following optimization problem: 



maximize 

Jc|l,...,ml 




subject to I J| = k. 



Xr ^ Xr-i U {X;}, 



maximize /(diag(u;)X, Y) 



w 



(2) 



subject to IIidIIi < r. 



4 



where || • ||i denotes the ^i-norm, diag(H;) is a diagonal matrix with w placed along its diag- 
onal, and r > is the tuning parameter for the radius of the ^pball. It is known that if r is 
sufficiently small, then the solution tends to be on a vertex of the simplex, which makes iv 
sparse [Tibshirani, 1996]. Features can then be selected according to the non-zero coefficients 
of the solution w. In fact, observations reveal that the number of non-zero coefficients tends to 
increase as r increases. So, a simple bisection method may be used to search for the value of r 
which gives k features. 

Unlike the sequential search, the feature weighting approach incorporates k into the problem 
through r from the beginning. So, the solutions for different values of k are not necessarily 
nested. This characteristic is particularly useful when there are multiple optimal feature subsets 
of different sizes which are disjoint. 



4 Feature Quality Measures 

In this section, we describe a number of feature quality measures commonly used in practice. A 
feature quality measure is a criterion which indicates how good the selected features are, and is 
the counterpart of the optimization strategy. Here, we focus on predictor-independent criteria. 



4.1 Pearson Correlation 

Pearson correlation is a well-known univariate statistical quantity which can be used to measure 
a linear dependency between two random variables X and Y. It is defined as 

cov(Z, Y) 

where cov(X, Y) denotes the covariance between X and Y, and cr{X) and cr{Y) are population 
standard deviation of X and Y, respectively. 

Although the independence of X and Y implies p = 0, the converse is not necessarily true 
since the correlation is capable of detecting only a linear dependency. An example would be a 
quadratic dependence Y = X^, which gives p = due to the cancellation of the negatively and 
the positively correlated components. 

For a feature selection purpose, |p| can be used to rank features. There are many feature 
selection algorithms based on Pearson correlation [Rodriguez-Lujan et al., 2010, Hall, 2000, 
Peng etal, 2005]. 



4.2 Hilbert-Schmidt Independence Criterion 

The Hilbert-Schmidt independence criterion (HSIC) [Gretton et al., 2005] is a multivariate de- 
pendence measure which can detect a non-linear dependency, and does not require a density 
estimation. 
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The formal definition of HSIC is given as follows. Let £>x and Dy be the domains of X 
and Y. Define a mapping (p{x) e T from all jc e Dx to the feature space T in such a way 
that the inner product of points in is given by a kernel function k{x, x') = ((/>(jc), ^(jc'))- This 
can be achieved if !F is a reproducing kernel Hilbert space on Dx [Aronszajn, 1950]. Similarly, 
define another reproducing kernel Hilbert space. ^ for Dy with feature map ifr and kernel 
Ky^y') = {^i.y)^^{y'))- Then, the cross-covariance operator [Fukumizu et al., 2004] associated 
with the joint probability p^^y is a linear operator Cxy defined as 

where ® is the tensor product. HSIC is defined as the squared Hilbert-Schmidt norm of the 
cross-covariance operator 

HSIC(R,„!r,^):=||CxFllL, 
which could be expressed in terms of kernels [Gretton et al., 2005] as 

miC{pxy, T, Q) =B,,x',y,y\k{x, x')l{y, y')] 

-2E,,,[E,,[fc(x,x')]E,,[/(i/,^')]]. 

^x,x' ,y,y'\]<-{x,x')l{y,y'y\ is the expectation over independent pairs {x,y) and {x' ,y') drawn from 
Pxy. Given an i.i.d. paired sample S = {(jc,, > an empirical estimator of HSIC is given by 

HSIC(.S, T, Q) = -^-r tr(KHLH), (4) 
(n - ly 

where K,L,H e R"^",(i^);j := k(Xi,Xj), {L\j := and H := I„ - 11^ /n (center- 

ing matrix). It was also shown that, if k and / are universal kernels (e.g., Gaussian kernels) 
[Steinwart, 2001], then HSIC(79vy, T,Q) = if and only if X and Y are independent. So, HSIC 
can also be used as a dependence measure. 

In spite of the strong theoretical properties of HSIC, there is no known objective criterion 
for model selection of the kernel functions k and /. A popular heuristic choice is to use a 
Gaussian kernel with its width set to the median of the pairwise distance of the data points 
[Scholkopf and Smola, 2002]. 



4.3 IMutual Information 

In information theory, mutual information [Cover and Thomas, 2006] is an important quantity 
which can be used to detect a general non-linear dependency between two random variables. It 
has been widely used as the criterion for feature selection [Peng et al., 2005, Suzuki et al., 2008, 
Rodriguez-Lujan et al., 2010] as well as feature extraction [Torkkola, 2003]. Mutual informa- 
tion is defined as 



Pxyix,y) 



I(X, Y):= I I log I p^^^p^^ I Pxyix, y) dxdy, (5) 
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which is the Kullback-Leibler divergence [KuUback and Leibler, 1951] from pxy(x,y) to 
Px(x)py(y). Mutual information is a measure of dependence in the sense that it is always non- 
negative, symmetric (I(X, Y) = I(Y,X)), and vanishes if and only if X and Y are independent, 
i.e., Pj,yix,y) = Pxix)py(y). 

Even though mutual information is a powerful multivariate measure, accurate estimation 
of the densities Pxy,Px and Py is difficult in high-dimensional case. A recent approach which 
avoids taking the ratio of estimated densities by directly modeling the density ratio ^'^^"^'^^ 



pJx)Pij(y) 

is Maximum Likelihood Mutual Information (MLMI) [Suzuki et al., 2008]. Although MLMI 
was demonstrated to be accurate, its estimation is computationally rather expensive due to the 
existence of the logarithm function. 



4.4 Squared-loss IMutual Information 

Another mutual information variant which has received much attention recently is 
Squared-loss Mutual Information (SMI) [Suzuki et al., 2009, Suzuki and Sugiyama, 2012, 
Hachiya and Sugiyama, 2010, Suzuki and Sugiyama, 2011] defined as 



2 

-ll 

Pxix)pyiy) 



7) := 1 I I I P-J^''^y\ _ 1 1 p^(x)py{y) dxdy. (6) 



SMI is based on the /-divergence [Ali and Silvey, 1966, Csiszar, 1967] with a squared loss (also 
known as the Pearson divergence, [Liese and Vajda, 2006]), as opposed to the ordinary mutual 
information which is based on the /-divergence with a log loss (Kullback-Leibler divergence, 
[KuUback and Leibler, 1951]). Note that I,(X, Y) = 7,(7, X), 7,(X, Y) > 0, and I,(X, 7) = if 
and only if pxy(x,y) = Px{x)py(y), just like the ordinary mutual information. Therefore, SMI 
can also be used as a measure of dependence between X and Y. 

SMI can be estimated by directly modeling the ratio g*ix, y) = P'"^-^'^-' itself without going 

through the estimation of the densities. The goal is to find a density ratio estimate 'g(jc, y) which 
is as close to the true density ratio g*(x,y) as possible. Here, the estimation can be formulated 
as a least-squares problem. That is, to find^(jc, y) such that its expected squared difi'erence from 
g*ix, y) is minimized: 



in ^ ^ {g{x, y) - g*{x, y)f Px(x)py(y) dxdy. (7) 



mm 

9 

Since finding g over all measurable functions is not tractable [Suzuki and Sugiyama, 2012], the 
model g is restricted to be in a linear subspace Q defined as 

g := {a^(p(x,y)\a = (au...,ai,f €R''}, 

where a is the model parameter to be learned, and <p(x, y) = (^i(jc, y), . . . , (fhix, y))^ is a basis 
function vector such that V/, ipi(x, y) > 0. The basis also admits kernel functions which depend 
on samples. 
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With ff, finding g amounts to finding the optimal a. By using an empirical approximation, 
Eq. (7) can be written as 

1 Y — -~r A J- 
mm -a Ha - h a + —a a, (8) 

aeRi' 2 2 

where the term ja^a with a regularization parameter /I > is included for a regularization 
purpose, and 

,=1 j=i 
1 " 

n 4-* 

By differentiating Eq. (8) with respect to a and equating it to zero, the solution a can be com- 
puted analytically as 

a = {H + h, 

where / denotes the identity matrix. Finally, using a, SMI in Eq. (6) can be estimated as 

- l-T^ 1 

Is = ^ha--. (9) 

The estimator in Eq. (9) is called Least-Squares Mutual Information (LSMI). 

LSMI possesses many good properties [Suzuki and Sugiyama, 2012]. For example, it has 
an optimal convergence rate in n under non-parametric setup. Also, LSMI is equipped with a 
model selection criterion for determining (p and A. Model selection by ^-fold cross validation 
is described as follows. First, randomly split samples {(jc,, i/,)}"^^ into (roughly) equal K disjoint 
subsets {iSjtl^j. An estimator ^ is then obtained using S-t := {Sj}j^k- Finally, the approxi- 
mation error for the held-out samples Sk is computed. The procedure is repeated K times, and 
((f. A) which minimizes the mean j(^-cv) chosen: 

5 Proposed Method 

In this section, we describe our proposed method. 

5.1 Motivations 

As mentioned previously, there are a number of factors which cause the difficulty of feature 
selection, i.e., non-linear dependency, feature interaction, and feature redundancy. Although 
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existing combinations of optimization strategies and measures can handle these problems, the 
trade-ofF of the computational complexity and the obtained abilities to deal with such issues is 
not well balanced. 

A summary of properties of common optimization strategies is shown in Table 1 . Ranking 
is very fast since it completely disregards feature redundancy and feature interaction, and fo- 
cuses on only feature relevancy. Forward search improves this by maintaining a set of selected 
features, and greedily adding each feature to the set. This allows the forward search to deal with 
feature redundancy by not adding a redundant feature to the set. Nevertheless, feature interac- 
tion cannot be detected since features are not considered in the presence of each other. This is 
why backward search comes to play by starting from the full feature set and iteratively remov- 
ing a feature instead. Although this scheme has a potential to detect interacting features, the 
complexity goes from 0(m) to 0(m^) which could be problematic when the number of features, 
m, is large. Considering all strategies, an ^i-based approach seems to be the optimal choice 
here. It offers a continuous optimization which is usually easier than a discrete optimization. 
Also, since all features are considered simultaneously by optimizing their weights, it can take 
into account feature redundancy and feature interaction. 

A summary of properties of feature quality measures is shown in Table 2. PC is very efficient 
to compute. However, only linear dependency can be identified. HSIC can reveal a non-linear 
dependency. Nonetheless, it is unclear how to objectively choose the right kernel function. MI 
is another measure that is capable of detecting a nonlinear dependency but the existence of log 
causes computational inefficiency. It can be seen that SMI has balanced properties here. Not 
only is it able to capture a non-linear dependency, using a squared loss instead of a log loss also 
permits its estimator to have an analytic form, which can be efficiently computed. 

Table 3 shows the combinations of optimization strategies and feature quality measures. 
Many of them have already been proposed in the past. Exhaustive search is marked impractical 
since it is computationally intractable. PC is a univariate measure which considers one feature at 
a time. Combining it with a feature-set optimization strategy (i.e., forward, backward search, £i 
approach) would degenerate back to a ranking approach. Hence, the combinations are marked 
unreasonable. 

It can be seen that the feature weighting with ^i-regularization is the best among the opti- 
mization strategies. Also, SMI has the best balance among the listed feature quality measures. 
We therefore propose to combine -regularized feature weighting with SMI, which we call 
^i-LSMI. 
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Table 1: Summary of properties of optimization strategies, "disc." and "cont." denote "discrete" 
and "continuous", respectively. 





Ranking 


Forward 


Backward 


Exhaustive 




Optimization 


disc. 


disc. 


disc. 


disc. 


cont. 


Complexity 


m 


m 


2 

m 


2'" 


m 


Redundancy 


X 


A 


o 


® 


o 


Interaction 


X 


X 


o 


@ 


o 



x: Not considered, A: Weak, Q: Good, @: Excellent 



Table 2: Summary of properties of feature quality measures. 





PC 


HSIC 


MI 


SMI 


Non-linear Dependency 


X 


o 


o 


o 


Model Selection 


not needed 


X 


o 


o 


Computational Efficiency 


@ 


o 


X 


A 



x: Not considered. A: Weak, Q: Good, ©: Excellent 



Table 3: Summary of combinations of optimization strategies and feature quality measures. 





Ranking 


Forward Backward 


Exhaustive 




PC 


0[Hall, 2000] 


X X 


X 


X 


HSIC 




0[Song et al., 2007] 0[Song et al., 2007] 


X 


A[Masaeli etal., 2010] 


MI 


0[Suzuki et al., 2008] 


o o 


X 




SMI 


0[Suzuki et al., 2009] 


0[Suzuki et al., 2009], 0[Suzuki et al., 2009] 


X 








[Hachiya and Sugiyama, 2010] 







O: Method exists. A: Variation exists, 
-: Method does not exist, x Method is unreasonable, impractical 



5.2 Formulation of -LSMI 

^i-LSMI attempts to find an /n-dimensional sparse weight vector by solving the following opti- 
mization problem: 

maximize /i(diag(H;)X, Y) 

lueR'" 

subject to l^w<r (10) 
w>0, 

where is the LSMI defined in Eq. (9), r > is the radius of the ^i-ball, 1 is the m-dimensional 
vector consisting of only I's, and ">" in id > is applied element- wise. Features are selected 
according to the non-zero coefficients of the learned w. Here, since the sign of wj does not affect 
the feature selection process, we only consider the positive orthant in R'". Thus, the constraint 
«; > is imposed, and ||h;||i reduces to l^w. 
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5.3 Advantages of ii -LSMI 



Using SMI allows a detection of nonlinear dependency between X and Y . Furthermore, by 
combining it with the ^i-regularization feature weighting scheme, feature interaction is also 
taken into account since all features are considered simultaneously. In general, the use of €i- 
regularization does not necessarily give an ability to deal with redundant features. That is, the 
weights of all redundant features may be all high. This drawback of ^i-regularization is covered 
by the use of SMI. Since adding a redundant feature to the selected subset does not increase 
the SMI value (i.e., no new information), ^i-LSMI implicitly deals with the feature redundancy 
issue by avoiding the inclusion of redundant features. This is achieved by simply maximizing 
SMI between the weighted features and the output. The use of density-ratio estimation in ap- 
proximating SMI also helps avoid the density estimation problem, which is difficult when m is 
large. 

5.4 Solving A -LSMI 

Here, we explain how we solve the ^i-LSMI optimization problem. 
5.4.1 Algorithm Overview 

Algorithm 1 is executed to find a /:-feature subset by a binary- search-liked scheme. Based on 
the observation that the number of obtained features tends to increase as r increases, the idea 
is to systematically vary r so that k features can be obtained. Starting from a low r, the ii- 
LSMI optimization problem is solved by iteratively performing gradient ascent and projection 
(constraint satisfaction). If k features can be obtained from the current r, then return them. Oth- 
erwise, r is doubled (starting from 2 : in Algorithm 1) until more than k features are obtained. 
The value of r firstly found to give more than k features is denoted by r^, and is assumed to 
be the upper bound of the value of r which can give k features. The lower bound r\ is then set 
to rh/2 which gives strictly less than k features. The rest of the procedure (starting from 12 : 
in Algorithm 1) is to find r e {r\,r^ using a binary search scheme, so that k features can be 
obtained. In each step of the search, Eq. (10) is solved using the middle point v between 
and r\. If k features cannot be found, or r\ is updated accordingly. This halving procedure is 
repeated until k features are found, or the time limit is reached. 

In case that a ^-feature subset cannot be found, obtained feature subsets X are sorted in 
ascending order of three keys given by \\X\ - k\, \X\ - k, Y). This means that the feature 

subsets whose size is closest to k are to be put towards the head of the list. With two sets whose 
size is equally closest to k, then prefer the smaller one (due to \X\ - k). If there are still many 
such subsets, bring the ones with highest hO^x^ Y) to the head of the list, where X;^ denotes the 
data matrix X with only rows indexed by X. In the end, the feature subset X at the head of the 
list is selected. 
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Algorithm 1 Pseudo code of ^i-LSMI to search for a ^-feature subset. 



Require: k (desired number of features) 
1: r <— 0.1 / /r is initially low 
2: repeat //try to find an upper bound rh 

3: r <- 2r 

4: 1^0 <— randomly initialize a feasible w 

5: A'r <— Solve Eq. (10) with (u;o,r) //z^^: set o£ features obtained using r 
6: if \X,\ = k then 
7: return 
8: end if 

9: until \Xi.\ > kor time limit exceeded 
10: rh ^ r 
11: ri ^ rh/2 

12: while time limit not exceeded do //find r e (ri,rh) which gives k features with 

a binary search 
13: V ^ (rh + ri)/2 
14: 1^0 <— randomly initialize a feasible u; 
15: Xr,^ <- Solve Eq. (10) with (wq, r^) 
16: if |;\i,,J = then 
17: return 
18: else if \XrJ < k then 
19: n ^ r^ 
20: else if \XrJ > k then 
21: rh <- rin 
22: end if 
23: end while 

24: S <— list of all X found so far, sorted in the ascending order by ||^| - k\, \X\ - k, -I^iXx, Y) 
25: return the first ^ in S 



5.4.2 Basis Function Design 

Estimation of SMI requires b basis functions. Here, we choose the basis functions to be in the 
form of a product kernel defined as 



^/(diag(«;)x, y) = (f>^idiagiw)x)4>1iy) for I = 1, . . . ,b. 



(11) 



(f>J(-) is defined to be the Gaussian kernel. 
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c(l) 6 {1, ... ,n} is a randomly chosen sample index without overlap. The definition of (p'jiy) 
depends on the task. For a regression task, is also defined to be a Gaussian kernel, 



For a C-class classification task in which Y 6 {1, . . . ,C}, the delta kernel is used on Y, i.e., 
c^liy) takes 1 if i/ = yc{i), and otherwise. Using these definitions, model selection for ((p. A) is 
reduced to selecting (cr. A). 

5.4.3 Optimization 

Given an initial point wq and the radius r, the ^i-LSMI optimization problem is simply solved by 
gradient ascent. To guarantee the feasibility, the updated w is projected onto the positive orthant 
of the constrained ^i-ball in each iteration. The projection can be carried out by first projecting 
w onto the positive orthant with max(H;, 0), where the max function is applied element- wise. 
This is then followed by a projection onto the ^i-ball which can be carried out in 0(m) time 
[Duchi etal., 2008]. 

In practice, there are many more sophisticated methods for solving Eq. (10), e.g., projected 
Newton-type methods [Lee et al., 2006, Schmidt et al., 2007]. These methods generally con- 
verge super-linearly, and are faster (in terms of the convergence rate) than ordinary gradient 
ascent algorithms which converge linearly. However, the notion of convergence does not take 
into account the number of function evaluations. In general, methods with a good convergence 
rate rely on a large number of function evaluations per iteration, i.e., performing line search to 
find a good step size. In our case, function evaluation is expensive since model selection for 
(cr. A) has to be performed. It turns out that using a more sophisticated solver may take more 
time to actually solve the problem even though the convergence rate is better. So, we decided to 
simply use a gradient ascent algorithm to solve the problem. Additionally, to further improve 
the computational efficiency, model selection is performed every five iterations, instead of every 
iteration. This is based on the fact that, in each iteration, w is not significantly altered. Hence, 
it makes sense to assume that the selected (cr, A) from the previous iteration are approximately 
correct. 

6 Experiments 

In this section, we report experimental results. 

6.1 IMethods to be Compared 

We compare the performance of the following feature selection algorithms: 
• PC (Pearson correlation ranking). 
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• F-HSIC (forward search with HSIC). 

• F-LSMI (forward search with LSMI) [Hachiya and Sugiyama, 2010]. 

• B-HSIC (backward search with HSIC) [Song et al., 2007]. 

• B-LSMl (backward search with LSMI). 

• ^i-HSlC (similar to ^i-LSMI, but the objective function is replaced with 
HSIC(diag(u;)X, Y)) . 

• /"i-LSMI^ (proposed method). 

• mRMR (Minimum Redundancy Maximum Relevance) [Peng et al., 2005]. mRMR is one 
of the state-of-the-art algorithms which selects features by solving 

redundancy measure 

lj]J]/(X.X,) 

That is, it uses mutual information to select relevant features which are not too redundant. 
mRMR solves the optimization problem by greedily adding one feature at a time until k 
features can be obtained. This scheme is similar to a forward search algorithm. 

• QPFS (Quadratic Programming Feature Selection) [Rodriguez-Lujan et al., 2010]. QPFS 
formulates the feature selection task as a quadratic programming problem of the form: 

minimize -(1 - a)w^Qw - af^w 

weEJ" 2 

subject to l^w = 1 
«;>0, 

where < or < 1 controls the trade-off between high relevancy (high a) and low redun- 
dancy of the selected features. Q = [qij] = \p(Xj,Xj)\ is the absolute value of the Pearson 
correlation between and Xj as in Eq. (3), and / = [fi] = |p(X,, Y)\. In the case that Y 
is categorical, the correlation for categorical variable as in [Hall, 2000] is used. In this 
experiment, we use the recommended value of a = q/(q + f) where q = -j^ Z!=i Z'jLi lij 
and / = ^ fi [Rodriguez-Lujan et al., 2010]. Notice that if a = 1, QPFS reduces to 
PC. 

'Matlab implementation of ^i-LSMI is available at http : //wittawat . com/so£tware/lllsmi/ 



relevancy measure 



maximize - ^ liXi, Y) - 

Jc(l,...,m| ^ 

subject to \I\ = k. 
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• Lasso [Tibshirani, 1996]. Lasso is a well-known method of least squares which imposes 
an /"i-norm constraint on the weight vector. Specifically, it solves the problem of the form: 

minimize ||Y - iw^XlP + ^||k;||i, 

weE.'" 

where /I > is the sparseness regularization parameter. In this experiment, A is varied so 
that k features can be obtained. 

• Relief [Kira and Rendell, 1992, Kononenko, 1994]. Relief is another state-of-the-art 
heuristic algorithm which scores each feature based on how it can discriminate differ- 
ent classes (distance-based). 

6.2 Toy Data Experiment 

An experiment is conducted on the following three toy datasets: 
1. and- or 

• Binary classification (4 true / 6 distracting features). 

• Y = {XiA X2) V (X3 A X4). 

• Xi, . . . ,^7 ~ Bernoulli(0.5), where Bernoulli(p) denotes the Bernoulli distribution 
taking value 1 with probability p. 

• Xg, . . . , Xio = Y with 0.2 chance of bit flip. 

• Characteristics: Feature redundancy and weak interaction. 



2. quad 



Regression (2 true / 8 distracting features). 



• Xi, . . . , Xg, e ~ N(0, 1), where N(ju, cr^) denotes the normal distribution with mean 
jj. and variance cr^. 

• X9 ~ 0.5X1 -I- 'Z/(-l, 1), where ^{a, b) is the uniform distribution on [a, b]. 

• Xio -0.5X2 + 1/(-l,l). 

• Characteristic: Non-linear dependency. 

3. xor 

• Binary classification (2 true / 8 distracting features). 

• y = xor(Xi , X2), where xor(Xi , X2) denotes the XOR function for Xi and X2. 

• Xi,...,X5 Bernoulli(0.5). 
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Table 4: Averaged F-measures on the and-or, quad, and xor datasets. 



Dataset 


PC 


F-HSIC 


F-LSMI 


B-HSIC 


B-LSMI 




and-or 

quad 

xor 


0.57 (.20) 
0.25 (.31) 


U.Zj (^.UUj 

0.95 (.15) 
0.52 (.50) 


U.J / (^.ZZ ) 

1.00 (.00) 

0.53 (.50) 


U.Zj (^.UUJ 

0.95 (.15) 
1.00 (.00) 


U.oJ (^.ZZj 

1.00 (.00) 
1.00 (.00) 


















Dataset 


^i-HSIC 


^i-LSMI 


mRMR 


QPFS 


Lasso 


Relief 


and-or 

quad 

xor 


0.25 (.00) 
0.64 (.23) 
1.00 (.00) 


1.00 (.00) 
1.00 (.00) 
1.00 (.00) 


0.25 (.00) 
1.00 (.00) 

0.28 (.31) 


0.41 (.17) 
0.64 (.23) 
0.25 (.32) 


0.21 (.09) 
0.66 (.25) 
0.26 (.32) 


0.55 (.15) 
1.00 (.00) 
1.00 (.00) 



• Xg, . . . ,Xio ~ Bernoulli(0.75). 

• Characteristic: Feature interaction. 

The number of features to select, k, is set to the number of true features in the respective 
dataset. For LSMI-based methods, Gaussian kernels are used as the basis functions and b is 
set to 100. Five-fold cross validation is carried out on a grid of (cr,A) candidates for model 
selection. For cr, the candidates are also adaptively scaled with the median of pairwise sample 
distance cr^ed, which depends on the currently selected features. 

CTn^ed = median({||x,- - Xy||2};<y). 

Gaussian kernels are also used in HSIC-based methods. However, since model selection is not 
available for HSIC, in F-HSIC and B-HSIC, the Gaussian width is heuristically set to craned 
[Scholkopf and Smola, 2002]. For ^i-HSIC, the Gaussian width is adaptively set to the median 
of pairwise distance of diag(H;)X every five iterations. Due to the non-convexity of the objective 
functions, ^i-LSMI and ^i-HSIC are restarted 20 times with randomly chosen initial points. 

The experiment is repeated 50 times with n = 400 points sampled in each trial. For each 
method and each dataset, an average of the F-measure over all trials is reported. The F-measure 
is defined as / = 2pr/{p + r), where 

* p = (number of correctly selected features) / (number of selected features). 

• r = (number of correctly selected features) / (number of correct features). 

An F-measure is bounded between and 1, and 1 is achieved if and only if all the true features 
are selected and none of the distracting features is selected. The results are shown in Table 4. 

PC ranks the relevance of each feature individually without taking into account the redun- 
dancy among features. This results in a failure on the and-or dataset since X^, . . . ,Xio, which 
are redundant, would simply be ranked top due to their similarity to Y. 

The forward search variants do not work on problems with feature interaction. To detect 
interacting features, it is necessary that all features be considered simultaneously. For this 
reason, F-HSIC and F-LSMI fail in the xor problem. 
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The performance of HSIC -based methods seems to be unstable in many cases. A possible 
cause of the instability is from the use of an incorrect parameter: The heuristic of using cr^^^d 
for the Gaussian width does not always work. As an example, given a fixed data matrix X, the 
more features selected, the larger cr^^d may become. This is because the Euclidean distance is a 
non-decreasing function of the dimension. So, inclusion of many irrelevant features obviously 
unnecessarily makes cr^ed larger. B-HSIC is subject to this weakness since it starts the search 
with all features. 

B-LSMI performs well in detecting non-linear dependency (quad) and feature interaction 
(xor). However, due to its greedy nature, the redundant features in the and-or problem are 
sometimes chosen. That is, in the first few iterations, all redundant features are kept, and one of 
the true features is eliminated instead. 

mRMR and QPFS have similar optimization strategies. That is, both of them measure the 
relevancy of each feature, and have a pairwise feature redundancy constraint. Regardless of 
the feature measure in use, considering features in a univariate way cannot reveal interacting 
features (by definition of feature interaction). Therefore, it is not surprising that both of them 
fail on the xor and and-or datasets. Nevertheless, mRMR works well on the quad dataset 
since mutual information can reveal a non-linear dependency. On the other hand, QPFS and 
Lasso do not perform well on the quad dataset since both of them use a linear measure. 

Relief is one of the few feature ranking algorithms which can consider feature interaction 
(the xor dataset) because of its distance-based nature. However, it suffers the same drawback 
as other ranking algorithms in that no redundancy is considered. Hence, it fails on the and-or 
dataset with the same reason as PC. 

The proposed ^i-LSMI performs well on all datasets. This clearly shows that ^i-LSMI 
can consider redundancy, detect non-linear dependency, and consider feature interaction, i^- 
based feature optimization enables a simultaneous consideration of features, which is the key in 
tackling the feature interaction problem. By using ^i-regularization in combination with SMI 
which can detect a non-linear dependency, /"i-LSMI can correctly choose the two true features 
in the quad problem. For the and-or problem, the pitfall is to choose Xg, . . . ,X\o because of 
their high correlation to Y. However, due to the usage of ^i-regularization, ^pLSMI attempts to 
find the four-feature subset which maximizes LSMI in a non-greedy manner. Since Xg, . . . , Xio 
contain bit-flip noise, inclusion of any of them will not deliver the maximum LSMI. In this case, 
the only four features which give the maximum LSMI are Xi, . . . ,^4, and thus preferred over 
any of Xg, . . . ,Xio. 

As an illustration of LSMI, Table 5 shows all possible 35 four-feature subsets of 
{Xi, . . . ,X4} U {Xg, . . . ,Xio} in the and-or problem and their corresponding LSMI values. It is 
evident that the correct subset {Xi, . . . , X4} has the highest LSMI. Inclusion of any of Xg, . . . , Xio 
(and thus remove some from {Xi, . . . , X4}) would cause a significant drop of the LSMI value. In 
the extreme case, with all Xg, . . . , Xio in the selected set (shown at the bottom of the table), the 
LSMI score becomes considerably low. This is because each of Xg, . . . ,Xio contains roughly 
the same information to explain Y. Thus, there is no gain in adding more features which share 
very similar information. 
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Table 5: All possible 35 four-feature subsets of {Xi, . . . ,^4} U {Xg, . . . ,Xio} in the and-or 
dataset, and their corresponding values of LSMI to the output Y = (Xi A X2) V (X3 A X4). 



Feature indices 


LSMI 


Feature indices 


LSMI 


1 2 


3 


4 


V.'tyO 


1 


4 


9 


10 


n 'XA 1 


1 2 


3 




8 


U. JDJ 


2 


3 


A 

4 




8 


U.JD / 


1 2 


3 


9 


U.Jo 1 


2 


3 


4 


9 


U. jQZ 




J 




V.J J 1 


z 




4 


lU 


U. jVU 


1 2 


4 


8 


^76 


2 


3 




8 


9 


341 


1 2 


4 


9 


U. jo^ 


2 


3 




8 


10 


U.jlZ 


1 2 


4 


10 


U.J /z 


2 


3 


9 


10 


U. jZZ 


1 2 


8 


9 


U. j'tD 


2 


4 


8 


9 


U. JtU 


1 2 


8 


10 


U. J jU 


2 


4 


8 


10 


U. jZo 


1 2 


9 


10 


0.336 


2 


4 


9 


10 


0.328 


1 3 


4 


8 


0.382 


3 


4 


8 


9 


0.356 


1 3 


4 


9 


0.376 


3 


4 


8 


10 


0.349 


1 3 


4 


10 


0.392 


3 


4 


9 


10 


0.353 


1 3 


8 


9 


0.325 


1 


8 


9 


10 


0.330 


1 3 


8 


10 


0.330 


2 


8 


9 


10 


0.334 


1 3 


9 


10 


0.333 


3 


8 


9 


10 


0.303 


1 4 


8 


9 


0.342 


4 


8 


9 


10 


0.335 



6.3 Real-Data Experiment 

To demonstrate the practical use of the proposed ^i-LSMI, we conduct experiments on real 
datasets without any specific domains. All the real datasets used in the experiments are sum- 
marized in Table 6. The "Task" column denotes the type of the problem (R for regression, and 
Cx for jc-class classification problem). The datasets cover a wide range of domains including 
image, speech, and bioinformatics. 

The experiment is repeated 20 times with n = 400 points sampled in each trial. In each 
trial, k is varied in the low range with a step size proportional to the entire dimensionality m. 
For classification, each selected ^-feature subset is scored with the test error of a support vector 
classifier (SVC) with Gaussian kernels. For regression, the root mean squared error of support 
vector regression (SVR) with Gaussian kernels is used. The hyper-parameters of SVC and SVR 
are chosen with cross validation. We use the implementations of SVC and SVR given in the 
LIBSVM library [Chang and Lin, 2001]^. The results are shown in Fig. 1. 

Overall, results suggest that using LSMI can give better features than HSIC (judged by the 
error of SVC/SVR). This shows the importance of the availability of a model selection crite- 
rion. ^i-LSMI and mRMR are competitive, especially on multi-class classification problems 
with many classes (e.g., segment and satimage). This is in contrast to PC and Relief which do 

^LIBSVM: http : //www . csie . ntu . edu . tw/~c j lin/libsvm/ 



18 



Table 6: Summary of the real datasets used in the experiments. 



Dataset 


m 


n 


Task 


Class balance (%) 


auaione 


o 
o 


A 1 11 


K 






Q 


Oil 

All 


^9 
v^Z 


70 8/9Q 9 
Ixj.oj Ly .L 




91 




I\ 




cisiices 


J ly 




K 




flaresolar 


Q 

y 


1 r>AA 

lUDD 


L-Z 


44. //j J.J 


germ an 




1 nnn 


^^9 


7n nnn n 


glass 


Q 


9 1 zl 




^9 7/^S ^/7 Q/fi 1 /A 9/1 ^ 


iiuusing 






1\ 




image 


1 8 

io 


1 1 


P9 
v^Z 


49 Q/'^7 1 


lono sp ncr c 


'X'X 
J J 


J J i 


^"9 
V^Z 


04. VjDJ.y 


isolct 


Dl / 


DZjo 


V_-ZD 


auoui j.o J /o per ciass 




QO 

y\j 


1 noon 


1\ 




mil cu' 1 


1 DO 


^- / u 


(^9 
v^z 


JO. J/4J. J 


lllUSKZ 


1 fifi 
IDD 


fi^Q8 


P9 
v^z 


84 (^/1 ^ 4 
04.0/1 J.4 


sdiimd-ge 


jD 


D'to J 




9'? 8/1 n Q/9 1 1 /Q 7/1 1 n/9'? 4 

zj.o/ lu.y/zi . // i i .u/zj.4 


l5V> £1111 wilt 


18 

1 o 


2310 


C7 


1 4- "nPT pla^i*? 


senseval2 


50 


534 


C3 


33.3% per class 


sonar 


60 


208 


C2 


46.6/53.4 


spectf 


44 


267 


C2 


20.6/79.4 


speech 


50 


400 


C2 


50.0/50.0 


vehicle 


18 


846 


C4 


25.1/25.7/25.8/23.5 


vowel 


13 


990 


Cll 


9.1% per class 


wine 


13 


178 


C3 


33.1/39.9/27.0 



All datasets were taken from UCI Machine Learning Repository: 
http : //archive . ics . uci . edu/ml/, except that cpuact is from 
http: //mldata. org/repository/data/viewslug/uci -20070 111- cpu_act/, 
SENSEVAL-2 is from the Second International Workshop on Evaluating Word Sense 
Disambiguation Systems: http : //www . sle . sharp . co . uk/senseval2, and speech is our 

In-house developed voice dataset. 

not handle multi-class problems well. As in the case of the toy data experiment, PC does not 
perform well in most cases since it does not take redundancy among features into account. An 
exception would be the senseval2 problem in which PC performs the best among others. This 
is because 50 features in the senseval2 dataset are derived from the first 50 principal compo- 
nents obtained by principal component analysis. Since principal components are orthogonal by 
definition, no redundancy has to be considered for this problem. In some cases, considering fea- 
ture redundancy may hurt the performance. This can be seen on image, cpuact, senseval2, 
and musk2 datasets when PC outperforms QPFS, suggesting that features may not be correlated. 
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Table 7: SVC/SVR errors of the features selected by PC, F-HSIC, F-LSMI, ^i-HSIC, ^i-LSMI, 
mRMR, QPFS, Lasso, and Relief on real datasets. 



Dataset 


m 


n 


k 


PC 


F-HSIC 


F-LSMI 


^l-HSIC 


^l-LSMI 


mRMR 


QPFS 


Lasso 


Relief 


abalone (R) 


8 


400 


4 


0.73 (.04) 


0.74 (.04) 


0.70 (.05) 


0.73 (.04) 


0.70 (.05) 


0.73 (.05) 


0.75 (.04) 


0.70 (.04) 


0.69 (.04) 


bcancer (C2) 


9 


277 


4 


0.24 (.00) 


0.24 (.00) 


0.23 (.01) 


0.23 (.00) 


0.23 (.01) 


0.25 (.00) 


0.23 (.00) 


0.24 (.00) 


0.26 (.00) 


glass (C6) 


9 


214 


4 


0.29 (.00) 


0.28 (.00) 


0.30 (.01) 


0.30 (.01) 


0.30 (.01) 


0.30 (.00) 


0.29 (.00) 


- 


0.31 (.00) 


housing (R) 


13 


400 


4 


4.03 (.19) 


4.14 (.20) 


4.20 (.21) 


3.95 (.20) 


3.91 (.19) 


3.97 (.20) 


4.11 (.23) 


4.14 (.27) 


4.10 (.21) 


vowel (Cll) 


13 


400 


4 


0.20 (.02) 


0.23 (.03) 


0.24 (.03) 


0.20 (.02) 


0.21 (.02) 


0.20 (.02) 


0.20 (.02) 


- 


0.21 (.02) 


wine (C3) 


13 


178 


4 


0.03 (.00) 


0.03 (.00) 


0.03 (.01) 


0.03 (.01) 


0.03 (.01) 


0.03 (.00) 


0.03 (.00) 


- 


0.03 (.00) 


image (C2) 


18 


400 


4 


0.10 (.01) 


0.19 (.03) 


0.17 (.03) 


0.13 (.03) 


0.06 (.02) 


0.14 (.02) 


0.11 (.02) 


0.11 (.02) 


0.05 (.01) 


segment (C7) 


18 


400 


4 


0.19 (.03) 


0.24 (.03) 


0.17 (.02) 


0.11 (.03) 


0.05 (.01) 


0.05 (.01) 


0.08 (.03) 


- 


0.13 (.02) 


vehicle (C4) 


18 


400 


4 


0.32 (.02) 


0.33 (.03) 


0.28 (.02) 


0.34 (.03) 


0.27 (.02) 


0.39 (.05) 


0.39 (.05) 


- 


0.32 (.04) 


german (C2) 


20 


400 


4 


0.25 (.02) 


0.29 (.01) 


0.29 (.02) 


0.25 (.02) 


0.25 (.02) 


0.25 (.02) 


0.25 (.02) 


0.25 (.02) 


0.26 (.02) 


cpuact (R) 


21 


400 


4 


0.25 (.03) 


0.33 (.12) 


0.28 (.07) 


0.54 (.31) 


0.25 (.16) 


0.23 (.06) 


0.27 (.04) 


0.26 (.04) 


0.37 (.09) 


ionosphere (C2) 


33 


351 


4 


0.07 (.00) 


0.07 (.00) 


0.08 (.01) 


0.07 (.00) 


0.07 (.00) 


0.09 (.00) 


0.07 (.00) 


0.07 (.00) 


0.07 (.00) 


satimage (C6) 


36 


400 


10 


0.22 (.02) 


0.14 (.01) 


0.13 (.02) 


0.14 (.02) 


0.13 (.02) 


0.14 (.01) 


0.14 (.02) 




0.16 (.02) 


spectf (C2) 


44 


267 


10 


0.19 (.00) 


0.17 (.00) 


0.17 (.01) 


0.19 (.01) 


0.17 (.01) 


0.18 (.00) 


0.18 (.00) 


0.18 (.00) 


0.18 (.00) 


senseval2 (C3) 


50 


400 


10 


0.18 (.01) 


0.18 (.01) 


0.18 (.02) 


0.19 (.02) 


0.18 (.01) 


0.18 (.01) 


0.18 (.01) 




0.21 (.01) 


speech (C2) 


50 


400 


10 


0.01 (.00) 


0.01 (.00) 


0.01 (.00) 


0.01 (.00) 


0.01 (.00) 


0.02 (.00) 


0.01 (.00) 


0.01 (.00) 


0.03 (.00) 


sonar (C2) 


60 


400 


10 


0.23 (.00) 


0.22 (.00) 


0.14 (.02) 


0.21 (.02) 


0.16 (.02) 


0.18 (.00) 


0.19 (.00) 


0.16 (.00) 


0.19 (.00) 


msd (R) 


90 


400 


10 


0.95 (.06) 


0.94 (.06) 


0.92 (.06) 


0.94 (.06) 


0.93 (.06) 


0.97 (.06) 


0.94 (.06) 


0.92 (.06) 


0.96 (.06) 


muskl (C2) 


166 


400 


20 


0.19 (.02) 


0.17 (.02) 


0.14 (.02) 


0.16 (.02) 


0.16 (.02) 


0.15 (.02) 


0.18 (.02) 


0.13 (.01) 


0.19 (.03) 


musk2 (C2) 


166 


400 


20 


0.09 (.01) 


0.08 (.01) 


0.07 (.01) 


0.09 (.01) 


0.08 (.01) 


0.09 (.01) 


0.09 (.02) 


0.07 (.01) 


0.09 (.01) 


ctslices (R) 


379 


400 


20 


0.79 (.07) 






0.64 (.05) 


0.60 (.07) 


0.45 (.04) 


0.46 (.02) 


0.41 (.03) 


0.56 (.05) 


isolet (C26) 


617 


400 


20 


0.54 (.03) 






0.36 (.04) 


0.27 (.03) 


0.30 (.03) 


0.30 (.03) 




0.49 (.03) 


Top Count 


3 


2 


7 


1 


11 


3 


1 


4 


2 



Thus, ignoring redundancy and considering just relevancy gives a better performance. ^i-HSIC 
performs well in many cases, but the performance may become unstable when k is high due to 
the mentioned fact that (Xmed also gets larger. 

To objectively compare the performance, another experiment with the same setting is carried 
out on 22 datasets. The number of trials is set to 50. For each method and dataset, k is set to 
either 4, 10, or 20 depending on how large m is. The selected fc- feature subsets are evaluated by 
SVC or SVR, as in the previous experiment. The results are given in Table 7, where for each 
dataset, the method with the best performance is shown in bold face. Other methods which have 
insignificant performance difference (based on the one-sided paired t-test with 5% significance 
level) to the best one are also marked in the same way. Note that Lasso works on only binary 
and regression problems. Thus, the results for multi-class problems are not available. For F- 
HSIC and F-LSML we omit the results on the ctslices and isolet datasets due to the large 
computation time involved. 

From the table, it can be seen quantitatively that overall ^i-LSMI performs the best by judg- 
ing from the number of times it ranks top. Interestingly, although worse on small datasets, 
the performance of mRMR approaches that of ^i-LSMI on high-dimensional datasets (i.e., the 
muskl, musk2, ctslices, and isolet datasets). One reasonable explanation for this phe- 
nomenon is that, a large number of features provide more freedom in choosing an alternative 
subset. Even though there are interacting features, there may be many other alternative non- 
interacting subsets which give an almost equivalent explanatory power. For this reason, the fact 
that mRMR cannot detect interacting features may be less significant. 
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image (m=18, n=400, task=2 classes' 



lerman (m=20, n=400, task=2 classes) 



cpuact (m=21, n=400, task=regression) 




Number of features 

(a) image 

segment {m=18, n=400, task=7 classes) 



Number of features 

(b) german 

vine {m=13, n=178, task=3 classes) 



Number of features 

(c) Cpuact 

flaresolar (m=9, n=400. task=2 classes) 



S 0.2 

"5 

2 0.15 

0) 

I 0.1' 
■o 
o 
cn 

So.05 



4 5 6 7 

Number of features 

(d) segment 

spectf (m=44, n=267, task=2 classes) 




Number of features 

(e) wine 

satimage (m=36, n=400. tasl^=6 classes) 
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Figure 1: Comparison of SVC/SVR errors of features selected by PC, ^i-HSlC, ^i-LSMl, 
mRMR, QPFS, Lasso and Relief. 
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7 Conclusion 



Feature selection is an important dimensionality reduction technique which can help improve 
the prediction performance and speed, and facilitate the interpretation of a learned predictive 
model. There are a number of factors which cause the difficulty of feature selection. These 
include non-linear dependency, feature redundancy, and feature interaction. 

The proposed ^i-LSMI is an -based algorithm that maximizes SMI between the selected 
feature and the output. The main idea is to learn a sparse feature weight vector whose coef- 
ficients can be used to determine the importance of features. Only features corresponding to 
the non-zero coefficients in the weight vector need to be kept. The use of /"i-regularization 
allows simultaneous consideration of features, which is essential in detecting a group of inter- 
acting features. By combining with SMI which is able to detect a non-linear dependency, and 
implicitly handle feature redundancy, a powerful feature selection algorithm is obtained. 

Extensive experiments were conducted to confirm the usefulness of ^i-LSMI. We therefore 
conclude that ^i-LSMI is a promising method for practical use. 
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